feat(bootstrap,cli): switch GPU injection to CDI where supported#495
feat(bootstrap,cli): switch GPU injection to CDI where supported#495
Conversation
b1e6015 to
dd2682c
Compare
|
808270c to
f304997
Compare
f304997 to
501b0c1
Compare
abee8e5 to
7389f4b
Compare
Use an explicit CDI device request (driver="cdi", device_ids=["nvidia.com/gpu=all"]) when the Docker daemon reports CDI spec directories via GET /info (SystemInfo.CDISpecDirs). This makes device injection declarative and decouples spec generation from consumption. When the daemon reports no CDI spec directories, fall back to the legacy NVIDIA device request (driver="nvidia", count=-1) which relies on the NVIDIA Container Runtime hook. Failure modes for both paths are equivalent: a missing or stale NVIDIA Container Toolkit installation will cause container start to fail. CDI spec generation is out of scope for this change; specs are expected to be pre-generated out-of-band, for example by the NVIDIA Container Toolkit. Signed-off-by: Evan Lezar <elezar@nvidia.com>
The --gpu flag on `gateway start` now accepts an optional value: --gpu Auto-select: CDI on Docker >= 28.2.0, legacy otherwise --gpu=legacy Force the legacy nvidia DeviceRequest (driver="nvidia") Internally, the gpu bool parameter to ensure_container is replaced with a device_ids slice. resolve_gpu_device_ids resolves the "auto" sentinel to a concrete device ID list based on the Docker daemon version, keeping the resolution logic in one place at deploy time. Signed-off-by: Evan Lezar <elezar@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
7389f4b to
c07c0f8
Compare
| | `--gpu` | Auto-select: CDI when enabled on the daemon, `--gpus all` otherwise | | ||
| | `--gpu=legacy` | Force `--gpus all` | |
There was a problem hiding this comment.
This simplification looks great, I was planning to comment on this and iterate on this and noticed it got updated.
For some extra context: one thing to keep in mind was that we won't be dependent on Docker in the long term (Drew is working on having VM-based deployment mode), so it's good to keep options small, so we don't have to deprecate/remove them in the VM.
On that note, we are thinking that removing the legacy option makes sense as well. The CDI has been on by default since Docker 28.2 (released in May 2025). If we get reports that it's needed for some reason, it's easy to add it then, and if it's not needed, we won't need to deprecate it with the switch to the VM.
|
There is one more spot that needs to be updated with this change: https://github.com/NVIDIA/nv-agent-env/blob/1f2a85e873a77ebb38fb492062f9fc936617f08a/crates/openshell-cli/src/main.rs#L1104-L1110 When the |
Summary
Switch GPU device injection in cluster bootstrap to use CDI (Container Device Interface) when enabled in Docker (the
docker infoendpoint returns a non-empty list of CDI spec directories). When this is not the case existing--gpus allNVIDIADeviceRequestpath is used as a fallback. The--gpuflag ongateway startis extended to let users force the legacy injection mode.Related Issue
Part of #398
Changes
feat(bootstrap): Auto-select CDI (driver="cdi",device_ids=["nvidia.com/gpu=all"]) if CDI is enabled on the daemon; fall back to legacydriver="nvidia"on older daemons or when CDI spec dirs are absentfeat(cli):--gpunow accepts an optional value: omit for auto-select,--gpu=legacyto force the legacy--gpus allpathtest(e2e): Extend gateway start help smoke test to cover--gpuand--recreateflagsTesting
mise run pre-commitpassesresolve_gpu_device_idscoverage)Checklist